feat: perf opt quant #14548

chraac · 2025-07-06T10:04:28Z

Performance Optimization for Quantization Operations

Overview

This PR introduces significant performance optimizations for quantized neural network operations in the hexagon-npu device backend, focusing on improved memory management, vectorized operations, and enhanced data type support.

Key Changes

Performance Optimizations

Optimized dot product implementations with mixed-precision support (F16×F32)
Improved VTCM cache utilization and reduced memory allocations
Enhanced matrix multiplication with better loop structure and prefetching

Quantization Improvements

Implemented dual/quad block processing for Q4_0 and Q8_0 operations
Added specialized aligned/unaligned code paths
Added configurable F16/F32 dequantization targets

Performance Impact

Performance benchmarks comparing Hexagon NPU with CPU implementation across various operations show interesting patterns based on batch size and quantization method.

Matrix Multiplication Performance

Operation	Dimensions	Hexagon NPU (GFLOPS)	CPU (GFLOPS)	NPU/CPU Ratio
MUL_MAT (q4_0)	n=1, k=14336	15.62	37.49	0.42x
MUL_MAT (q4_0)	n=4, k=14336	35.63	36.31	0.98x
MUL_MAT (q4_0)	n=8, k=14336	45.11	45.82	0.98x
MUL_MAT (q4_0)	n=512, k=14336	60.28	63.65	0.95x
MUL_MAT (q8_0)	n=1, k=14336	3.81	39.68	0.10x
MUL_MAT (q8_0)	n=512, k=14336	58.73	70.31	0.84x
MUL_MAT (f16)	n=1, k=14336	10.39	11.53	0.90x
MUL_MAT (f16)	n=512, k=14336	10.31	25.74	0.40x

Attention Mechanism Performance

Operation	Parameters	Hexagon NPU (GFLOPS)	CPU (GFLOPS)	NPU/CPU Ratio
FLASH_ATTN	hsk=64, nh=8, kv=4096	2.38	4.73	0.50x
FLASH_ATTN	hsk=128, nh=8, kv=4096	4.07	6.04	0.67x
FLASH_ATTN	hsk=128, nh=8, kv=16384	4.03	5.61	0.72x

Elementary Operations

Operation	Dimensions	Hexagon NPU (GB/s)	CPU (GB/s)	NPU/CPU Ratio
ADD	[4096,1,1,1]	20.45	4.08	5.01x
ADD	[4096,512,1,1]	25.68	19.14	1.34x

Key Performance Insights

Quantization Impact:
- q4_0 consistently outperforms other quantization methods on NPU
- At n=512, q4_0 (60.28 GFLOPS) slightly outperforms q8_0 (58.73 GFLOPS)
- q4_K performs poorly compared to q4_0 across all batch sizes
Relative to CPU:
- NPU excels at elementary vector operations (5x faster for small ADD operations)
- NPU achieves near-CPU performance for matrix multiplication at large batch sizes
- CPU maintains advantage for attention mechanisms across all tested configurations

test-backend-ops-perf-all.release.hexagon.51c53ae8f.log

test-backend-ops-perf-all.release.cpu.989772c7b.log

Unit tests

[hexagon-npu][ROPE][]supported, dst: f32[100x32x2], src0: f32[100x32x2], src1: i32[2], supported/unsupported: 1058/5194
[hexagon-npu][ROPE][]supported, dst: f32[100x32x2], src0: f32[100x32x2], src1: i32[2], supported/unsupported: 1059/5194
[hexagon-npu]Unsupported op: TRANSPOSE
[hexagon-npu][TRANSPOSE][ (reshaped) (transposed)]unsupported, dst: f32[2x3200], src0: f32[3200x2], supported/unsupported: 1059/5195
unload rpcmem lib successfully
  LLAMA(n_tokens=2): not supported [hexagon-npu] 
  6239/6239 tests passed
  Backend hexagon-npu: �[1;32mOK�[0m

Backend 2/2: CPU
  Skipping
2/2 backends passed
�[1;32mOK�[0m

8gen2-test-backend-ops-all.debug.hexagon.51c53ae8f

add log commit qnn buffer after changed add log register_rpc_mem 2 times update input tensors before graph finalize default to QNN_TENSORMEMTYPE_RAW set new tensors at execute move write input tensors to exec check if mem registered before actual do register rpc mem once allocated

…ecks

…th additional vector sums

…readability and maintainability

…r size handling

…tization functions

…exibility in vector multiplication

…for clarity

…erformance

…y in variable usage

…e in vector operations

…oved performance and clarity

…r loop

chraac added 30 commits June 17, 2024 18:44

rename

9456bba

fix todo

65a14d9

make the constant condition first

aeef0c6

remove TODO

dfe159f

split logger function, tensors and backend from main qnn source

9932062

remove reference of g_qnn_mgr in qnn_instance

3c491a3

fix compiling error

3fe07eb

rename

37a1585

move qnn helper function into utility files

ff0359d

fix op handle checker

e1056da

split qnn ops into file

c9e99bd

Merge branch 'master' into dev-refactoring

3808a4c

move qnn backend into sub folder

8b677d1

fix compiling error after merge latest master

38f88d5

add clang format file and reformating

000240c

add ggml_qnn_graph

ca0d999

move graph map to backend object

4b2ee61

add op param to add_nodes

a688ed3

use qnn graph inside add and mul ops

13dc3a0

reformat

58cec14

move tensor related function to utils

0f2e687

add helper function to get Qnn_TensorType_t from ggml_tensor

4b0f6b0

small opt of the qnn graph config init

263ffa9

remove unused members

874216b

refactoring ggml_qnn_tensor

5f2e391

fix compiling error in debug build

af869fd

add log

a7be069

use helper function instead

9add256

add log

dc7d83e

chraac added 27 commits June 30, 2025 14:45

fix compiling error

989772c

Merge branch 'dev-refactoring' into dev-perf-opt-quant

333aeaf

wip

482ef7f

Enhance matrix multiplication for F32 and F16 types with alignment ch…

ef52220

…ecks

Optimize vec_dot_product_mix_aligned_impl for improved performance wi…

228dbd3

…th additional vector sums

Add alignment checks for matrix multiplication and vector dot products

d670c1e

Refactor matrix multiplication to use function pointers for improved …

ec35125

…readability and maintainability

Fix alignment check in is_dot_product_aligned to ensure correct vecto…

4d39eba

…r size handling

Remove unused f16_to_f32_table parameter from quantization and dequan…

f932d7e

…tization functions

wip

e359081

Add L2 fetch for src1 plane rows in matrix multiplication implementation

73ce562

wip

1238b85

Refactor hvx_vsf_convert_vhf to accept an additional parameter for fl…

c87de31

…exibility in vector multiplication

Refactor vec_dot_product_mix_aligned_impl to improve variable naming …

c08d7d1

…for clarity

Refactor load_dual_block_generic and dequantize_row_q4_0 to improve p…

12305a1

…erformance

Refactor vector operation functions to improve clarity and consistenc…

4eaaa5d

…y in variable usage

wip

ca889d3

wip

47fbbf2

Refactor dequantize_row_q4_0_impl for improved clarity and performanc…

6ab41f3

…e in vector operations

wip

06a6723

Update load_dual_block_generic to use intrinsics

8a1c8af

Refactor load_dual_block_generic and load_qual_block_generic for impr…

1ae3726

…oved performance and clarity

wip

1e5be35

wip

ec48c42

Optimize dequantize_row_q8_0 for improved performance by unrolling fo…

61c6b89

…r loop

wip

c18d4c1

wip

51c53ae

github-actions bot added build Compilation issues ggml changes relating to the ggml tensor library for machine learning labels Jul 6, 2025

chraac closed this Jul 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: perf opt quant #14548

feat: perf opt quant #14548

Uh oh!

chraac commented Jul 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: perf opt quant #14548

feat: perf opt quant #14548

Uh oh!

Conversation

chraac commented Jul 6, 2025

Performance Optimization for Quantization Operations

Overview

Key Changes

Performance Optimizations

Quantization Improvements

Performance Impact

Matrix Multiplication Performance

Attention Mechanism Performance

Elementary Operations

Key Performance Insights

Unit tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants